In this section, we utilize the diabetes dataset to illustrate the process of building and interpreting various models with the assistance of AI. We begin by examining the familiar linear regression model, which is widely recognized and comprehended. Next, we delve into the decision tree model, which, while new to some, aligns with our intuition. For example, the decision tree model finds practical application in areas such as the loan approval process within banks and patient diagnosis. Finally, we explore a more intricate mathematical model, highlighting the potential of AI in addressing unfamiliar modeling scenarios.
Prompt:
Show a few rows of the diabetes dataset in scikit-learn as a table.
from sklearn.datasets import load_diabetes
import pandas as pd
# Load the diabetes dataset
diabetes = load_diabetes()
# Create a pandas DataFrame from the dataset
data = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
target = pd.DataFrame(data=diabetes.target, columns=['target'])
df = pd.concat([data, target], axis=1)
# Display the first few rows as a table
display(df.head())
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 | 151.0 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 | 75.0 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 | 141.0 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 | 206.0 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 | 135.0 |
df.shape[0]*0.8
353.6
Linear Regression¶
Prompt:
Use the diabetes data from scikit-learn. Preprocess the data as follows:
1) Load the data and assign it to X, and the target to y. The target variable represents disease progression.
2) Rename the columns to make them more understandable, as shown here:{'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure', 's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL', 's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}In this context, BP stands for blood pressure, and BMI refers to body mass index. The other six columns correspond to blood sample measurements.
Perform the following steps to analyze the data:
1) Plot the correlation matrix of features as a heatmap.
2) Split the data into training and test sets, allocating 80% of the data for training and 20% for testing.
3) Utilize a linear regression model to predict the target variable.
4) Evaluate the model's performance using the mean absolute error (MAE) metric.
5) Compare the predictions of the model with a baseline model that predicts the median diabetes progression (y) across all patients. Calculate the MAE of this baseline model. Print out both MAEs.
Show python code.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
import numpy as np
# Load the diabetes dataset
diabetes = load_diabetes()
# Assign features (X) and target (y)
X = diabetes.data
y = diabetes.target
# Rename the columns
column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}
df = pd.DataFrame(X, columns=[column_names[col] for col in diabetes.feature_names])
df['Disease Progression'] = y
# Show a few rows of the diabetes data (features and target in a table)
print("Diabetes Data (Features and Target):")
display(df.head())
# Plot the correlation matrix as a heatmap
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Diabetes Data - Correlation Matrix")
plt.show()
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Utilize linear regression to predict the target variable
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance using MAE
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
# Calculate the MAE for the baseline model
baseline_pred = [np.median(y)] * len(y_test)
baseline_mae = mean_absolute_error(y_test, baseline_pred)
# Print MAEs
print(f"MAE of the linear regression model: {mae:.2f}")
print(f"MAE of the baseline model: {baseline_mae:.2f}")
Diabetes Data (Features and Target):
| Age | Sex | BMI | Blood Pressure | Total Cholesterol | LDL | HDL | Thyroid | Glaucoma | Glucose | Disease Progression | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 | 151.0 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 | 75.0 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 | 141.0 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 | 206.0 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 | 135.0 |
MAE of the linear regression model: 42.79 MAE of the baseline model: 62.59
It is a good improvement, but it does not provide information about the model's decision-making process or highlight the key input features that contribute the most to the prediction.
Apparently, explaining the algorithm does not help with this understanding. In general, the ability to describe and explain a machine learning algorithm does not guarantee interpretability.
In the case of linear regression, the final prediction is indeed obtained by taking a weighted sum of the input features. By examining the learned weights, we can determine the impact or importance of each feature in predicting the target variable. This can be achieved by considering the absolute value of the corresponding weight.
Prompt:
Analyze the impact of each feature in a linear regression model in the following steps:
1) Retrieve the weights assigned to each feature.
2) Create a plot of the weights, where the feature names are displayed on the y-axis with different colors. The feature names should be arranged in increasing order of the absolute values of their corresponding weights.
Show python code.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
# Step 1: Load the data and assign it to X and y
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target
# Step 2: Rename the columns
column_names = {
'age': 'Age',
'sex': 'Sex',
'bmi': 'BMI',
'bp': 'Blood Pressure',
's1': 'Total Cholesterol',
's2': 'LDL',
's3': 'HDL',
's4': 'Thyroid',
's5': 'Glaucoma',
's6': 'Glucose'
}
X.rename(columns=column_names, inplace=True)
# Fit a linear regression model
model = LinearRegression()
model.fit(X, y)
# Step 1: Retrieve the weights assigned to each feature
feature_weights = model.coef_
# Step 2: Create a plot of the weights, arranged by the absolute values of weights
sorted_indices = np.argsort(np.abs(feature_weights))
sorted_feature_names = X.columns[sorted_indices]
sorted_weights = feature_weights[sorted_indices]
plt.figure(figsize=(8, 6))
plt.barh(range(len(sorted_feature_names)), sorted_weights, color=['r' if w < 0 else 'g' for w in sorted_weights])
plt.yticks(range(len(sorted_feature_names)), sorted_feature_names)
plt.xlabel('Weight')
plt.ylabel('Feature')
plt.title('Impact of Features in Linear Regression')
plt.show()
Try to interpret the model.
Decision Tree¶
A decision tree is a great machine learning algorithm that can be used to model complex nonlinear relationships. It has relatively higher predictive power than linear regression and is also highly interpretable.
Prompt:
Use the diabetes data from scikit-learn. preprocess the data as follows:
1) Load the data and assign it to X, and the target to y. The target variable represents disease progression.
2) Rename the columns to make them more understandable, as shown here:{'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure', 's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL', 's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}In this context, BP stands for blood pressure, and BMI refers to body mass index. The other six columns correspond to blood sample measurements.
Perform the following steps to analyze the data:
1) Split the data into training and test sets, allocating 80% of the data for training and 20% for testing.
2) Utilize a decision tree model to predict the target variable.
3) Evaluate the model's performance using the mean absolute error (MAE) metric.
4) Visualize the decision tree.
5) Plot the feature importance for the diabetes dataset using the decision tree model. Display the feature names on the y-axis with different colors. Arrange the feature names in increasing order of importance.
Show python code.
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt
from sklearn import tree
# Load the diabetes dataset
diabetes = load_diabetes()
# Assign features (X) and target variable (y)
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target
# Rename the columns
column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}
X.rename(columns=column_names, inplace=True)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a decision tree model
model = DecisionTreeRegressor(max_depth=3, random_state=42)
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model's performance using mean absolute error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")
# Visualize the decision tree
plt.figure(figsize=(12, 8))
tree.plot_tree(model, feature_names=X.columns, filled=True)
plt.show()
# Plot the feature importance
feature_importance = model.feature_importances_
sorted_indices = np.argsort(feature_importance)
sorted_features = X.columns[sorted_indices]
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, feature_importance[sorted_indices])
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Feature Importance for Diabetes Dataset')
plt.show()
### ChatGPT may ignore the requirements on different colors
### User code to generate colored chart
# feature_importance = model.feature_importances_
# sorted_indices = np.argsort(feature_importance)
# sorted_features = X.columns[sorted_indices]
# import matplotlib
# sorted_importances = feature_importance[sorted_indices]
# cmap = matplotlib.colormaps['viridis']
# colors = cmap(sorted_importances)
# plt.figure(figsize=(10, 6))
# plt.barh(sorted_features, feature_importance[sorted_indices], color=colors)
# plt.xlabel('Feature Importance')
# plt.ylabel('Features')
# plt.title('Feature Importance for Diabetes Dataset')
# plt.show()
Mean Absolute Error (MAE): 48.0965923002249
Tasks:
set
max_depth=3as followsmodel = DecisionTreeRegressor(max_depth=3, random_state=42)
to understand the prediction of a decision tree which consists of a series of if-else conditions.
set
max_depth=Noneto understand the feature importance.compare the MAEs of the above settings to understand overfitting. The decision tree model is prone to the problem of overfitting. The value of samples in a node represents the number of data points or instances from the training set that reach that particular node in a decision tree. In our context the sample value would indicate the number of patients whose data is used to make predictions at that node.
Generalized Additive Models (GAMs)¶
Generalized Additive Models (GAMs) are models with moderate to high predictive power and are highly interpretable. They can effectively model the nonlinear relationships between the features and target variables by using smoothing functions for each feature and summing them together.
Prompt:
Use the diabetes data from scikit-learn. preprocess the data as follows:
1) Load the data and assign it to X, and the target to y. The target variable represents disease progression.
2) Rename the columns to make them more understandable, as shown here:{'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure', 's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL', 's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}In this context, BP stands for blood pressure, and BMI refers to body mass index. The other six columns correspond to blood sample measurements.
Perform the following steps to analyze the data:
1) Split the data into training and test sets, allocating 80% of the data for training and 20% for testing.
2) Utilize a generalized additive model (GAM) to predict the target variable.
3) Evaluate the model's performance using the mean absolute error (MAE) metric.
4) Plot the effect of each of the patient features on the target variable with 95% confidence intervals.
Show python code.
I have a hard time making ChatGPT give me executable code without errors. You can simply copy/paste the error message to ChatGPT and ask it to correct the code, or ignore Step 4) at first to simplify the tasks for ChatGPT. Here are a few notes regarding the code ChatGPT provided: In building the GAM model, we use
gam = LinearGAM(s(0) + f(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9), n_splines=35)
where f(1) instead of s(1) because the feature Sex is categorical.
When train the model, instead of using
gam.fit(X_train, y_train)
we can use
gam.gridsearch(X_train, y_train)
to perform training and determine optimal parameter values.
# may run this cell twice
!pip install pygam
Requirement already satisfied: pygam in /usr/local/lib/python3.10/dist-packages (0.9.0) Requirement already satisfied: numpy<2.0.0,>=1.24.2 in /usr/local/lib/python3.10/dist-packages (from pygam) (1.25.1) Requirement already satisfied: progressbar2<5.0.0,>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pygam) (4.2.0) Requirement already satisfied: scipy<2.0.0,>=1.10.1 in /usr/local/lib/python3.10/dist-packages (from pygam) (1.10.1) Requirement already satisfied: python-utils>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from progressbar2<5.0.0,>=4.2.0->pygam) (3.7.0) Requirement already satisfied: typing-extensions>3.10.0.2 in /usr/local/lib/python3.10/dist-packages (from python-utils>=3.0.0->progressbar2<5.0.0,>=4.2.0->pygam) (4.7.1)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from pygam import LinearGAM, s, f, l
from sklearn.metrics import mean_absolute_error
# Step 1: Load the data and assign it to X, and the target to y
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# Step 2: Rename the columns
column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL', 's4': 'Thyroid',
's5': 'Glaucoma', 's6': 'Glucose'}
feature_names = [column_names[col] for col in diabetes.feature_names]
# Step 3: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Utilize a generalized additive model (GAM) with grid search for hyperparameter tuning
gam = LinearGAM(s(0) + f(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9), n_splines=35)
# Step 5: Fit the model on the training data
# gam.gridsearch(X_train, y_train)
gam.fit(X_train, y_train)
# Step 6: Evaluate the model's performance using mean absolute error (MAE) metric
y_pred = gam.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae:.2f}")
# Step 7: Plot the effect of each patient feature on the target variable with 95% confidence intervals
fig, axs = plt.subplots(2, 5, figsize=(15, 6))
axs = axs.ravel()
for i in range(X.shape[1]):
XX = gam.generate_X_grid(term=i)
partial_dependence, confidence_intervals = gam.partial_dependence(term=i, X=XX, width=0.95)
axs[i].plot(XX[:, i], partial_dependence, color='blue')
axs[i].fill_between(XX[:, i], confidence_intervals[:, 0], confidence_intervals[:, 1], color='lightblue', alpha=0.7)
axs[i].set_xlabel(feature_names[i])
axs[i].set_ylabel('Target Variable')
axs[i].set_title(f'Effect of {feature_names[i]}')
plt.tight_layout()
plt.show()
Mean Absolute Error (MAE): 41.38
# # Here is another block of working code generated by ChatGPT
# import numpy as np
# import pandas as pd
# from sklearn.datasets import load_diabetes
# from sklearn.model_selection import train_test_split
# from pygam import LinearGAM, s, f
# from sklearn.metrics import mean_absolute_error
# import matplotlib.pyplot as plt
# # Load the diabetes dataset
# diabetes = load_diabetes()
# # Assign features (X) and target variable (y)
# X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
# y = diabetes.target
# # Rename the columns
# column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
# 's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
# 's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}
# X.rename(columns=column_names, inplace=True)
# # Split the data into training and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# # Create a generalized additive model (GAM)
# model = LinearGAM(s(0) + f(1) + s(2) + s(3) + s(4) + s(5) + s(6) + s(7) + s(8) + s(9), n_splines=35).fit(X_train, y_train)
# # Make predictions on the test set
# y_pred = model.predict(X_test)
# # Evaluate the model's performance using mean absolute error (MAE)
# mae = mean_absolute_error(y_test, y_pred)
# print(f"Mean Absolute Error (MAE): {mae}")
# # Plot the effect of each patient feature on the target variable with 95% confidence intervals
# fig, axs = plt.subplots(2, X.shape[1] // 2, figsize=(15, 6))
# for i, feature in enumerate(X.columns):
# row = i // (X.shape[1] // 2)
# col = i % (X.shape[1] // 2)
# XX = model.generate_X_grid(term=i)
# axs[row, col].plot(XX[:, i], model.partial_dependence(term=i, X=XX))
# axs[row, col].plot(XX[:, i], model.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
# axs[row, col].set_title(feature)
# plt.tight_layout()
# plt.show()
Tasks:
Explain the features BMI and Blood Pressure (BP). Why is there a wider confidence interval around the lower and higher ends of BMI and BP?
What can you say about the impact of higher LDL values (or bad cholesterol) on the target variable?
from sklearn.datasets import load_diabetes
import pandas as pd
# Load the diabetes dataset
diabetes = load_diabetes()
# Create a pandas DataFrame from the dataset
column_names = {'age': 'Age', 'sex': 'Sex', 'bmi': 'BMI', 'bp': 'Blood Pressure',
's1': 'Total Cholesterol', 's2': 'LDL', 's3': 'HDL',
's4': 'Thyroid', 's5': 'Glaucoma', 's6': 'Glucose'}
df = pd.DataFrame(X, columns=[column_names[col] for col in diabetes.feature_names])
df['Disease Progression'] = diabetes.target
print(df[(df['Total Cholesterol'] > 0.15) & (df['LDL'] > 0.19)])
Age Sex BMI Blood Pressure Total Cholesterol LDL \
123 0.005383 0.05068 0.034751 -0.001091 0.152538 0.198788
HDL Thyroid Glaucoma Glucose Disease Progression
123 -0.061809 0.185234 0.015568 0.07348 84.0